JCO Clinical Cancer Informatics
● American Society of Clinical Oncology (ASCO)
Preprints posted in the last 90 days, ranked by how well they match JCO Clinical Cancer Informatics's content profile, based on 18 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Lee, M. H.; Xiao, Y.; Li, X.; Klee, E.; Yang, P.; Sio, T.; Wang, L.; Cerhan, J. R.; Zong, N.
Show abstract
BackgroundElectronic health record (EHR)-based prognostic modeling is increasingly used in oncology, yet incorporating pharmacogenomic (PGx) knowledge derived from experimental systems into clinical prediction frameworks remains challenging. This gap is driven by fundamental mismatches between controlled drug-mutation assays and heterogeneous, incomplete real-world clinical data. MethodsWe propose a representation transfer framework that integrates PGx embeddings learned from large-scale in vitro pharmacogenomic screens into patient-level EHR models. A frozen pharmacogenomic encoder is used to generate interaction-aware embeddings from patient mutation profiles and administered therapies, which are aggregated into a fixed-length PGx Complementarity Representation. These representations are incorporated into multimodal survival prediction models alongside standard clinical features. Performance was evaluated using systematic modality ablation analyses, attribution analyses, and exploratory unsupervised representation analyses. ResultsIntegrating PGx embeddings yielded consistent performance improvements across all evaluated modality combinations. Relative gains were largest in modality-sparse settings, where baseline EHR features encode limited biological context, and were attenuated--but remained significant--in biologically enriched configurations. Attribution analyses indicated that PGx embeddings contributed non-redundant predictive signal beyond standard clinical features. Exploratory unsupervised analyses further demonstrated that the learned representations exhibit interpretable association patterns aligned with known therapeutic exposures and pathway-level associations. ConclusionThese findings suggest that externally learned pharmacogenomic representations can be transferred into real-world EHR models as a context-dependent, non-redundant augmentation. By framing PGx knowledge as an interaction-aware representation rather than a mechanistic model, this work provides an informatics framework for integrating experimental pharmacogenomic data into clinical prediction tasks in a reproducible and interpretable manner.
Petalcorin, M. I. R.
Show abstract
Background: Early-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support. Objective: To develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics. Methods: A workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs. Results: The final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold. Conclusions: This proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.
Dickerson, J. C.; McClure, M. B.; Shaw, M.; Reitsma, M. B.; Dalal, N. H.; Kurian, A. W.; Caswell-Jin, J. L.
Show abstract
Background: Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to determine whether variables from complex longitudinal oncology records can be abstracted with performance similar to that of expert medical oncologists. Methods: We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLM were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets. Results: Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use. Conclusions: Off-the-shelf LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.
Soltanifar, M.; Portuguese, A. J.; Jeon, Y.; Gauthier, J.; Lee, C. H.
Show abstract
Oncology research and clinical practice in North America increasingly rely on complex endpoints, heterogeneous study designs, and high-dimensional molecular data. In this landscape, data visualization serves as a critical analytic instrument for study design communication, model diagnostics, safety reporting, and real-time clinical decision support. Despite its importance, the oncology visualization ecosystem remains fragmented across commercial platforms and bespoke scripts, lacking a unified, code-first reference that emphasizes reproducibility and auditability in the R programming environment. This paper addresses this gap by presenting a North American collaborative atlas of 62 oncology visualization templates: 24 for clinical trials, 12 for real-world evidence (RWE), and 26 common to both settings. A core innovation of this atlas is its simulation-driven approach; each plot is illustrated using transparent, reproducible data-generating mechanisms. This allows users to deterministically recreate figures and easily adapt templates to alternative endpoints, censoring patterns, and subgroup structures. The paper provides foundational notation for oncology endpoints, an operational taxonomy based on data geometry, and a consolidated review of relevant R software. We further synthesize the practical utility of these methods through four representative case studies and provide a comparative analysis of the strengths, limitations, and future challenges of oncology data visualization. A detailed tutorial on fishplot is included to demonstrate a publication-ready workflow for clonal evolution.
Windisch, P.; Weyrich, J.; Dennstaedt, F.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Three flagship models (GPT-5.2, Gemini 3 Flash, Claude Opus 4.5) were queried with default settings in two independent conditions: label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings after whitespace normalization, and a separate judge step used an LLM to rate whether each quote supported the assigned label. ResultsEvidence requirements modestly reduced coverage (GPT-5.2 86.2% to 84.3%, Gemini 98.3% to 92.8%, Claude 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Conditional macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. ConclusionSubstring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
IntroductionManual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/MethodsWe developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a mid-size language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. ResultsThe pipeline achieved mean F1 scores of 0.80 {+/-} 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 {+/-} 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 {+/-} 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). ConclusionsOncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/26347717v1_ufig1.gif" ALT="Figure 1000"> View larger version (23K): org.highwire.dtl.DTLVardef@178a4e8org.highwire.dtl.DTLVardef@1928b7corg.highwire.dtl.DTLVardef@38f36org.highwire.dtl.DTLVardef@1af4d51_HPS_FORMAT_FIGEXP M_FIG C_FIG
Xu, S.; Wang, Z.; Wang, H.; Ding, Z.; Zou, Y.; Cao, Y.
Show abstract
Online cancer peer-support communities generate large volumes of patient-authored and caregiver-authored text that may reflect distress, coping, and informational needs. Automated emotional tone classification could support scalable monitoring, but supervised modeling depends on label quality and may benefit from explicit context features. Using the Mental Health Insights: Vulnerable Cancer Survivors & Caregivers dataset, we compared five model families (TF-IDF Logistic Regression, Random Forest, LightGBM, GRU, and fine-tuned ALBERT) on a three-class target (Negative/Neutral/Positive) derived from four original categories. We introduced two extensions: (i) LLM-based annotation to generate parallel "AI labels" and (ii) token-based augmentation that prepends LLM-extracted structured variables (reporter role and cancer type) to the post text. Models were trained with a 60/20/20 stratified train/validation/test split, with hyperparameters selected on validation data only. Test performance was summarized using weighted F1 and macro one-vs-rest AUC with bootstrap confidence intervals, with paired comparisons based on McNemar tests and false discovery rate adjustment. The LLM annotator produced substantial redistribution in the four-class label space, shifting prevalence toward very negative relative to the original labels; the shift persisted but attenuated after collapsing to three classes. Across all model families, token augmen-tation improved held-out performance, with the largest gains for GRU and consistent improvements for ALBERT. Augmentation also reduced polarity-reversing errors (Nega-{leftrightarrow} tive Positive) for ALBERT, while adjacent errors (Negative {leftrightarrow} Neutral) remained the dominant residual failure mode. These results indicate that LLM-based supervision can introduce systematic measurement shifts that require auditing, yet LLM-extracted context incorporated via simple token augmentation provides a pragmatic, model-agnostic mechanism to improve downstream emotional tone classification for supportive oncology decision support. Author summaryWe studied how to better monitor emotional tone in posts from online cancer peer-support communities, where patients and caregivers share experiences that may signal distress, coping, or unmet needs. Automated classification could help organizations and moderators identify when additional support may be needed, but these systems depend on the quality of the labels used for training and may miss clinical context. Using a public dataset of cancer survivor and caregiver posts, we trained and compared several machine-learning and deep-learning models to classify each post as negative, neutral, or positive. We tested two practical improvements. First, we used a large language model to generate an additional set of "AI labels" and examined how these differed from the original categories. Second, we extracted simple context information--whether the writer was a patient or caregiver and what cancer type was mentioned--and added this context to the text before model training. We found that adding context consistently improved performance across model types. However, the AI-generated labels shifted class distributions, indicating that automated labeling can introduce systematic changes that should be audited. Overall, simple context extraction can make emotional tone monitoring more accurate and useful for supportive oncology decision support.
Petalcorin, M. I. R.
Show abstract
Background: Modern oncology development depends on integrating radiographic response, molecular biomarkers, treatment exposure, safety, and survival endpoints, yet access to well-structured patient-level trial data is often limited. Methods: We developed a synthetic, literature-informed phase II randomized oncology trial framework that followed the sequence Patient [->] Data [->] Dataset [->] Analysis [->] Tables/Figures [->] Decision. A cohort of randomized patients was simulated with baseline demographic and disease features, longitudinal tumor measurements, circulating tumor DNA, inflammatory and exploratory biomarkers, adverse events, treatment exposure, and survival outcomes. Raw source datasets were transformed into SDTM-like domains and ADaM-like analysis datasets, then analyzed for baseline characteristics, exposure, best overall response, survival, subgroup hazard ratios, longitudinal tumor and biomarker changes, exposure-response, and safety. Results: The treatment arm showed a coherent efficacy signal across multiple analytical layers. Treatment increased objective response and clinical benefit, reduced tumor burden over time, and prolonged survival. Median overall survival increased from 135 days in the control arm to 288 days in the treatment arm, with an approximate hazard ratio of 0.661 (95% CI, 0.480-0.911; p = 0.011). Median progression-free survival increased from 116 to 208 days, with an approximate hazard ratio of 0.601 (95% CI, 0.418-0.864; p = 0.006). Circulating tumor DNA showed a more favorable trajectory in treated patients and aligned directionally with radiographic and survival benefit. Safety analyses showed increased treatment-related toxicity, but the overall safety profile remained interpretable and compatible with continued development. Conclusions: This study demonstrates that a synthetic, literature-informed oncology trial can reproduce a biologically plausible and analytically coherent efficacy-safety signal architecture across radiographic, molecular, and time-to-event endpoints, providing a decision-oriented prototype for translational oncology clinical data science. Keywords: synthetic clinical trial, oncology, ctDNA, Kaplan-Meier, biomarker, survival analysis, translational data science, ADaM, SDTM
Makani, A.
Show abstract
Medical oncology education faces a dual crisis: knowledge velocity that outpaces static curricula and large language model (LLM) risks--hallucination and automation bias--that threaten the fidelity of AI-assisted learning. We present Onco-Shikshak V7, an AI-native adaptive learning platform that addresses both challenges through a unified cognitive architecture grounded in learning science. The system replaces isolated educational modules with four authentic clinical workflows--Morning Report, Tumor Board, Clinic Day, and AI Textbook--each scaffolded by a nine-module pedagogy engine that integrates ACT-R activation dynamics (illness scripts), Item Response Theory (adaptive difficulty), the Free Spaced Repetition Scheduler (FSRS v4), Zone of Proximal Development (scaffolding), and metacognitive calibration training (Brier score). Six specialist AI agents--medical oncology, radiation oncology, surgical oncology, pathology, radiology, and oncology navigation--engage in multi-disciplinary deliberation with per-specialty retrieval-augmented generation (RAG) grounding across nine authoritative guideline sources including NCCN, ESMO, and ASTRO. The platform provides 18 clinical cases with decision trees across six cancer types, maps every interaction to 13 ACGME Hematology-Oncology milestones, and implements four closed-loop feedback mechanisms that connect session errors to targeted flashcards, weak domains to suggested cases, and all interactions to a persistent learner profile. Technical validation confirms algorithmic correctness across eight subsystems. To our knowledge, this is the first system to unify ACT-R, IRT, FSRS, ZPD, and metacognitive calibration in a single medical education platform. Formal learner evaluation via randomized controlled trial is planned.
Hughes, N.; Hogenboom, J.; Carter, R.; Norman, L.; Gouthamchand, V.; Lindner, O.; Connearn, E.; Lobo Gomes, A.; Sikora-Koperska, A.; Rosinska, M.; Pogoda, K.; Wiechno, P.; Jagodzinska-Mucha, P.; Lugowska, I.; Hanebaum, S.; Dekker, A.; van der Graaf, W.; Husson, O.; Wee, L.; Feltbower, R.; Stark, D.
Show abstract
Background: Population-based cancer registers (PBCR) are important for monitoring trends in cancer epidemiology, facilitating the implementation of effective cancer services. Adolescents and Young Adult (AYA) with cancer are a patient group with a unique set of needs. The utility of PBCR in AYA is limited by the lack of AYA-specific data items. STRONG AYA, an international multidisciplinary consortium is addressing this through federated learning (FL) methodology and novel data visualisation concepts. A Core Outcome Set (COS) has been developed to measure outcomes of importance through clinical data and Patient Reported Outcomes (PROs). We describe how data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP), a PBCR in the UK is being used within STRONG AYA and how the subsequent analyses can guide patient consultations. Methods: Data from the YSRCCYP were imported into a Vantage 6 node, from which FL analyses are performed along with data provided by other consortium members. The results are extracted into the PROMPT software and integrated into patient electronic healthcare records. Results: Healthcare professionals can view the results of individual PROs at various time points and in comparison, to summary analyses carried out within the STRONG AYA infrastructure. Results can be filtered by age, disease, country and stage. Conclusion: We have demonstrated how a regional PBCR can contribute to a pan-European infrastructure and analyses viewed to enhance patient consultations. Such analyses have the potential to be used for research and policy-making, improving outcomes for AYA.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, d. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects "reading" of abstract evidence or "remembering" of known trials. MethodsWe used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. ResultsWith full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score {approx} 0.99) but modestly reduced Gemini (accuracy and F1 Score {approx} 0.97) and Claude (accuracy and F1 Score {approx} 0.95), mainly via lower sensitivity. ConclusionLLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.
Zia, M. K.; Plessinger, B.; Eng, K. H.; Flierl, A.; Wilbert, M.; Jans, K.; Whalen, P.; Mullin, S.; Ohm, J.; Singh, A. K.; Farrugia, M.; Morrison, C.; Darlak, C. J.; Seshadri, M.
Show abstract
The lack of interoperability among clinical and research data systems poses a significant barrier to cancer researchers interested in evaluating novel mechanistic hypotheses or translating innovative treatment strategies from the laboratory to the clinic. To address this gap in knowledge, we developed an innovative, web-based, data discovery, visualization and analysis tool (nSight) that allows researchers to quickly and easily query clinical/research data and construct de-identified cancer cohorts. Guiding principles for development of the tool were focused on ease of use, intuitiveness, self-service, and presentation of structured but de-identified data to the end user. nSight provides users with information on patient demographics, disease histology, diagnostic procedures and therapeutic interventions, timeline of disease progression/recurrence, along with available molecular profiling/sequencing data and indicators of participation in epidemiologic or lifestyle studies for specific cancer patient cohorts. The platform also allows users to obtain summary statistics based on demographic, histologic and clinical factors as well as perform basic survival analysis using Kaplan-Meier curves between specific patient cohorts. nSight is an intuitive, user-friendly tool that enables visualization, integration and analysis of multimodal clinical and research data without placing high technical demands or time constraints on researchers. The platform is designed for research feasibility assessment, cohort development, and retrospective data discovery, which in turn should help investigators identify potential study populations and explore novel hypotheses.
Windisch, P.; Koechli, C.; Dennstaedt, F.; Aebersold, D. M.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeTo quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. MethodsWe utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss {kappa} across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. ResultsGemini showed near-perfect agreement across settings ({kappa}=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT-5.2 reproducibility was similarly high ({kappa}=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. ConclusionFor strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.
McPhaul, T.; Kreimeyer, K.; Baris, A.; Botsis, T.
Show abstract
Cancer data standardization requires converting unstructured pathology reports into structured registry variables, a mostly manual and resource-intensive task. We evaluated two automated extraction platforms: Brim Analytics, an LLM-based system that guides and orchestrates abstraction, and DeepPhe, an ontology-driven system. Using 330 pancreatic adenocarcinoma and 34 breast cancer pathology reports from Johns Hopkins Hospital, we assessed both under deployment-realistic conditions. Brim Analytics achieved high accuracy across seven registry variables in pancreatic cancer (mean 96.7%), including T stage (96.4%) and histologic grade (97.0%), with a 3.0 p.p. decline on breast cancer (mean 93.7%). DeepPhe performed comparably for N stage (96.4% pancreatic, 94.1% breast) but had notable T stage deficits (83.6% pancreatic, 70.6% breast). Per-report processing times averaged 0.9 s (Brim, pancreatic), 4.6 s (Brim, breast), 1.1 s (DeepPhe, pancreatic), and 3.5 s (DeepPhe, breast). These results indicate that LLM-based extraction can achieve high accuracy across cancer types and support automated data workflows.
Shim, K. B.
Show abstract
Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive ab- straction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Un- sloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evalu- ation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub- MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.
Weyrich, J.; Dennstaedt, F.; Foerster, R.; Schroeder, C.; Aebersold, D. M.; Zwahlen, D. R.; Windisch, P.
Show abstract
PurposeLarge language models (LLMs) offer significant potential for automating the classification of clinical trials by eligibility criteria. However, a critical question remains regarding the optimal input data: while abstracts provide a condensed, high-density signal, full-text articles contain a much higher volume of information. It remains unclear whether the additional signal found in full texts improves classification performance or if the accompanying noise (in the form of thousands of words irrelevant to the question at hand in a complete manuscript) negatively affects the models reasoning capabilities. MethodsGPT-5 was applied to classify 200 randomized controlled oncology trials from high-impact medical journals, labeling them whether patients with localized and/or metastatic disease were eligible for inclusion. Each trial was classified twice - once using only the abstract and once using the full text - and GPT-5s outputs were compared with the ground-truth labels established by manual annotation. Performance was assessed by calculating and comparing accuracy, precision, recall, and F1 score, and the McNemar test was used to assess the statistical significance of the differences between the two input formats. ResultsFor identifying trials including patients with localized disease, GPT-5 achieved an accuracy of 86% (95% CI: 81% - 91%; F1 = 0.90) when using abstracts and 92% (95% CI: 88% - 95%; F1 = 0.92) when using full texts (p = 0.027). Performance for detecting trials, which include patients with metastatic disease, was comparably high, with accuracies of 99% (95% CI: 99% - 100%; F1 = 1.00) based on abstracts and 98% (95% CI: 97% - 100%; F1 = 0.99) based on full texts. Overall accuracy for assigning combined labels per trial increased from 86% (95% CI: 81% - 91%) using abstracts to 92% (95% CI: 88% - 95%) using full texts (p = 0.027). ConclusionProviding full-text articles to GPT-5 significantly improved the classification of trial eligibility criteria. These findings suggest that, for this task, the benefit of the additional signal contained within the full text outweighed the potential for performance degradation caused by increased noise. Utilizing full-text analysis appears particularly valuable for extracting specific eligibility criteria in oncology that are frequently omitted or not explicitly described within the abstract.
Olgiati, S.; Santona, F.; Meloni, D.; Barabino, E.; Rossi, G.; Genova, C.; Grossi, F.; Heidari, N.
Show abstract
BackgroundPrevious research has shown that radiomics-based machine learning models are promising precision medicine tools for lesion-level predictions of Anti-PD-1 response in advanced non-small cell lung cancer (NSCLC) but their clinical implementation remains limited due to poor generalizability and uncertainty about which features capture true biological signals or merely reflect noise. We tested the performance and multi-center validity of a radiomics-based photonic quantum architecture trained on a feature space reduced using clinical knowledge and robust medical statistics. MethodsThis study included 125 patients with 164 advanced NSCLC single lesions from 3 different hospitals (Train, Test 1 and Test 2) treated with anti-PD1 monotherapy as first or second line. All patients underwent a baseline CT scan before the start of the treatment, the lesions were semi-automatically segmented and labeled as "progressive" if their diameter increased by more than 10% and as "non-progressive" if their diameter decreased by 10% or more over the following 6 months. From each CT scan we extracted 851 radiomic features with a METhodological RadiomICs Score (METRICS) of 86.1% (Excellent Quality Category (Table S1)) of which 183 were identified as reliable based on previous published clinical research. We then trained 1 classical and 3 photonic quantum machine learning models in Train Hospital and tested their performance on unseen external datasets in Test Hospitals 1 & 2. Aiming to explore quantum machine learning as a long-term technology for precision oncology, we utilized an ideal classical simulation of a photonic quantum architecture. This approach assumes perfectly functioning hardware, eliminating confounding physical noise to assess true theoretical performance. Crucially, by adapting the standardized MerLin template, we ensure our results are reproducible, fulfilling an essential requirement for evidence-based clinical research. FindingsWe found that only 2 features out of 851 were both reliable and robustly correlated to the target (p < 0.001). These 2 features were used to train the machine learning models. Across external validation datasets, the LEXGROUPING-6modes quantum architecture explicitly outperformed the classical MLP baseline in Test Hospital 1 (Average Precision 0.755 vs. 0.702) and matched its performance in Test Hospital 2 (0.670). Notably, all photonic quantum architectures successfully exceeded the chance level defined by progressor prevalence (0.622 and 0.462, respectively). InterpretationTo our knowledge, this is the first study that tests the external validity of radiomics-based photonic quantum architectures utilizing an evidence-based, statistically significant reduced feature space. Crucially, demonstrating that a quantum architecture can outperform or match an optimized classical baseline represents a significant milestone. These findings validate the theoretical potential of quantum models to capture complex biological signals, supporting their future role as clinical decision support systems for NSCLC immunotherapy as both dedicated quantum algorithms evolve and physical hardware matures. Furthermore, we found supporting evidence that heavily reducing the feature space can improve generalizability without compromising performance. Future research is required to assess scalability to other clinical centers and validate these models on physical photonic quantum processors under realistic hardware noise conditions.
Bouteiller, J.; Gryspeert, A.-R.; Caron, J.; Polit, L.; Altay, G.; Cabantous, M.; Pietrzak, R.; Graziosi, F.; Longarini, M.; Schutte, K.; Cartry, J.; Mathieu, J. R.; Bedja, S.; Boileve, A.; Ducreux, M.; Pages, D.-L.; Jaulin, F.; Ronteix, G.
Show abstract
Background: Predicting whether a treatment will demonstrate meaningful clinical benefit before committing to a large-scale trial remains a major unmet need in oncology. Patient-derived organoids (PDOs) recapitulate individual tumor drug sensitivity, but have not been used to forecast population-level trial outcomes. We developed SCOPE (Screening-to-Clinical Outcome Prediction Engine), a platform that integrates PDO drug screening with clinical prognostic modeling to predict arm-level median progression-free survival (mPFS) and objective response rate (ORR) without access to any trial outcome data. Patients and methods: SCOPE was trained on 54 treatment lines from patients with metastatic colorectal cancer (mCRC, n=15) and metastatic pancreatic ductal adenocarcinoma (mPDAC, n=39) with matched clinical data and PDO drug screening across 9 compounds. A Clinical Score module captures baseline prognosis; a Drug Screen Score module quantifies treatment-specific organoid sensitivity. To predict trial outcomes, synthetic patient profiles are generated from published eligibility criteria and matched to a biobank of 81 PDO lines. Predictions were externally validated against 32 arms from 23 published trials, treatment ranking was assessed across 8 head-to-head comparisons, and prospective applicability was tested for daraxonrasib (RMC-6236), a novel pan-RAS inhibitor in mPDAC. Results: Predicted mPFS strongly agreed with published outcomes (R2=0.85, MAE=0.82 months; Pearson r=0.92, P<0.001), approaching the empirical concordance between two independently measured clinical endpoints (ORR vs. mPFS, R2=0.87). ORR prediction was similarly robust (R2=0.71, MAE=7.3 percentage points). Integrating organoid and clinical data significantly outperformed either alone (P=0.001). SCOPE correctly identified the superior arm in 7 of 8 head-to-head comparisons (88%, P<0.05). Applied to daraxonrasib prior to phase 3 data availability, the platform predicted superiority over standard chemotherapy in KRAS-mutant mPDAC, consistent with emerging clinical data. Conclusion: By combining functional organoid drug screening with clinical modeling, SCOPE generates calibrated efficacy predictions for both established regimens and novel agents without prior clinical data. This approach could support clinical trial design, treatment arm selection, and go/no-go decisions, offering a new tool to improve the efficiency of gastrointestinal cancer drug development.
Sepulchre, E.; Rouette, A.; Freycon, C.; Witkowski, L.; Jammali, S.; Sontag, T.; Langlois, S.; Sultan, N.; Budd, C.; Lisi, V.; Richer, C.; Jouan, L.; Lepage, M.-E.; Reichmann, L.; Foulkes, W.; Laberge, A.-M.; Michon, B.; Brossard, J.; Jabado, N.; Sinnett, D.; Tran, T.-H.; Vairy, S.; Santiago, R.; Cellot, S.; Goudie, C.; Lavallee, V.-P.
Show abstract
BackgroundThe province of Quebec has progressively implemented paired tumour-germline sequencing in paediatric oncology through two coordinated precision research programs, preceding a province-wide mainstream clinical genomics initiative. We report the prevalence, spectrum, and clinical relevance of germline findings (GFs) in children with primary extracranial cancers, integrating molecular, phenotypic, and pathological data. MethodsPatients enrolled between 2014 and 2022 underwent germline whole-exome sequencing (WES) using a virtual 352-cancer gene panel. Sequencing, bioinformatics and variant interpretation followed best practices standards based on GATK, ACMG/AMP and ClinGen recommendations. Somatic WES and transcriptomic data were integrated when available. GFs were categorised as diagnostic findings (DFs; established or suspicious association with the cancer phenotype) or as other findings further subcategorised according to actionability and age of disease onset. FindingsAmong 484 children, 130 (26.9%) carried 149 GFs, including 49 (10.1%) with a DF (42 with well-established associations with cancer phenotypes). DFs involved 21 genes related to childhood cancer predisposition, trisomy 21 and one clinical Beckwith-Wiedemann syndrome. Six DFs were initially missed by standard exome pipelines, and mosaic constitutional cancer predisposition syndrome (CPS) was confirmed in 4/49 children, underscoring the value of integrative analyses. A CPS was known at the time of primary cancer in 10/49 children. Among those diagnosed with a CPS after cancer onset, suggestive phenotypic features were present in 36/39. Other non-diagnostic findings were identified in 92 children; 21 (4.3% of the cohort) with actionable implications in childhood (n=7) or adulthood (n=14). Somatic sequencing was informative for refining causality, as somatic second hit alterations were identified in 29/33 (87.9%) DFs involving monoallelic tumour suppressor genes, whereas no such alterations were observed in non-DFs counterparts (0/57; p<0.0001). Interpretation: This provincial research experience highlights the analytical and practical challenges of germline evaluation in paediatric oncology and supports a shift toward integrative interpretation frameworks combining complementary germline, somatic, pathology, and phenotypic data. Flexibility in investigative strategies and nuanced categorisation of findings are warranted, guided by a child-centred interpretative framework. This approach underpins Quebecs paediatric oncology genomics mainstreaming initiative.
Jonnalagadda, P.; Obeng-Gyasi, S.; Stover, D. G.; Andersen, B. L.; Rahurkar, S.
Show abstract
BackgroundMany patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC. MethodsWe trained various ML models (logistic regression, decision trees, random forests, XGBoost, naive Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble models performance on clinical data. ResultsInitial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature. ConclusionsTransfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning. Author SummaryIn this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need. To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer. Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.